October 5, 2015

Setting Expectations

Expect to:

  1. Get a sense of what is possible with R.
  2. Set up important frameworks around how to do data science (forewarning - I'm biased!)
  3. Be warned of potential minefields

Do Not Expect to:

  1. Completely understand all the code presented on your first try.
  2. Be exposed to all of the features of the mentioned packages.

The Data Scientist's Workflow

R Solutions

Case Study Overview

Goal: Reproduce the final pharmacogenomic regression model and create an interactive web dose calculator.


This study used clinical and genetic data from a broad population to estimate appropriate warfarin dose for patients newly starting warfarin.

The main data set for the study is available on PharmGKB.

Data Set

This data set is in an excel file and presents some challenges:

  • Column names with odd symbols
  • Data types change with different project sites

For the purpose of this tutorial I have reduced the complexity of the data set and converted it to a tab-delimited file available on my GitHub. We will use read.delim() to deal with the odd column names.

iwpc_data <- read.delim(file = "iwpc_data_7_3_09_revised3.txt")
iwpc_data %<>% tbl_df()

Checking out our Data

iwpc_data %>% 
  group_by(Project.Site) %>% 
  sample_n(1) %>% 
  datatable(rownames = FALSE, options = list(pageLength=2, scrollX = TRUE))

Tidying our Data - Column Names

iwpc_data %<>% 
  rename(subject_id = PharmGKB.Subject.ID,
         sample_id = PharmGKB.Sample.ID,
         project_site = Project.Site,
         gender = Gender,
         race_reported = Race..Reported.,
         race_omb = Race..OMB.,
         ethnicity_reported = Ethnicity..Reported.,
         ethnicitiy_omb = Ethnicity..OMB.,
         age = Age,
         height = Height..cm.,
         weight = Weight..kg.,
         indication = Indication.for.Warfarin.Treatment,
         comorbidities = Comorbidities,
         medications = Medications,
         target_inr = Target.INR,
         target_inr_estimated = Estimated.Target.INR.Range.Based.on.Indication,
         reached_stable_dose = Subject.Reached.Stable.Dose.of.Warfarin,
         therapeutic_warfarin_dose = Therapeutic.Dose.of.Warfarin,
         inr_on_warfarin = INR.on.Reported.Therapeutic.Dose.of.Warfarin,
         smoker = Current.Smoker,
         cyp2c9_consensus = CYP2C9.consensus,
         vkorc1_1639_consensus = VKORC1..1639.consensus)

Data Manipulation - Which Variables?

They Used:

  • Age in decades = 1 for 10-19, etc…
  • VKORC1 G/A = 1 if heterozygous
  • VKORC1 A/A = 1 if homozygous for A
  • VKORC1 genotype unknown = 1
  • CYP2C9 *1/*2 = 1 if *1/*2
  • CYP2C9 *1/*3 = 1 if *1/*3
  • CYP2C9 *2/*2 = 1 if homozygous *2
  • CYP2C9 *2/*3 = 1 if *2/*3
  • CYP2C9 *3/*3 = 1 if homozygous *3
  • CYP2C9 genotype unknown = 1
  • Asian Race = 1
  • Black/African American = 1
  • Missing or Mixed race = 1


  • Amiodarone status = 1
  • Enzyme inducer status = 1 if any of: rifampin, carbamazepine, phenytoin, rifampicin

We Have:

  • Age: 10-19, 20-29, 30-39 etc.
  • VKORC1: A/A, A/G, G/G
  • CYP2C9: combinations of: *1, *2, *3, *5, *6, *8, *11, etc.
  • Race: Asian, Black or African America, White, Other
  • Medications: character list of medications, lack of medications, etc.

Data Manipulation: Dummy Code Race

iwpc_data %<>% 
  mutate(asian = ifelse(str_detect(race_omb, "Asian"),
                        yes = 1,
                        no = 0),
         african_american = ifelse(str_detect(race_omb, "Black or African American"),
                                   yes = 1, 
                                   no = 0),
         missing_or_mixed_race = ifelse(str_detect(race_omb, "Unknown"),
                                        yes = 1,
                                        no = 0))
Race OMB Asian African American Missing/Mixed Race N
Asian 1 0 0 1634
Black or African American 0 1 0 462
Unknown 0 0 1 482
White 0 0 0 3122

Data Manipulation: VKORC1

iwpc_data %>% 
  mutate(vkorc1_1639_ag = ifelse(str_detect(vkorc1_1639_consensus,"A/G"),
                                 yes = 1, no = 0),
         vkorc1_1639_aa = ifelse(str_detect(vkorc1_1639_consensus, "A/A"),
                                 yes = 1,no = 0),
         vkorc1_1639_unknown = ifelse(is.na(vkorc1_1639_consensus),
                                      yes = 1,no = 0))
VKORC1 1639 VKORC1 A/G VKORC1 A/A VKORC1 Unknown N
A/A 0 1 0 1485
A/G 1 0 0 1470
G/G 0 0 0 1246
NA NA NA 1 1499

Warning: Running str_detect() on NA returns NA. If you don't know that data value, R doesn't know if there's a match!

Data Manipulation: VKORC1

iwpc_data %<>% 
  mutate(vkorc1_1639_ag = ifelse(is.na(vkorc1_1639_consensus) | 
                                   !str_detect(vkorc1_1639_consensus,"A/G"),
                                 yes = 0,  no = 1),
         vkorc1_1639_aa = ifelse(is.na(vkorc1_1639_consensus) | 
                                   !str_detect(vkorc1_1639_consensus, "A/A"),
                                 yes = 0, no = 1),
         vkorc1_1639_unknown = ifelse(is.na(vkorc1_1639_consensus),
                                      yes = 1, no = 0))
VKORC1 1639 VKORC1 A/G VKORC1 A/A VKORC1 Unknown N
A/A 0 1 0 1485
A/G 1 0 0 1470
G/G 0 0 0 1246
NA 0 0 1 1499

Data Manipulation: CYP2C9

iwpc_data %<>% 
  mutate(cyp2c9_1_2 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*1/\\*2"),
                             yes = 0, no = 1),
         cyp2c9_1_3 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*1/\\*3"),
                             yes = 0, no = 1),
         cyp2c9_2_2 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*2/\\*2"),
                             yes = 0, no = 1),
         cyp2c9_2_3 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*2/\\*3"),
                             yes = 0, no = 1),
         cyp2c9_3_3 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*3/\\*3"),
                             yes = 0, no = 1),
         cyp2c9_unknown = ifelse(is.na(cyp2c9_consensus),
                                 yes = 1,no = 0))

Data Manipulation: CYP2C9

CYP2C9 1/2 1/3 2/2 2/3 3/3 Unknown N
1/1 0 0 0 0 0 0 4157
1/11 0 0 0 0 0 0 6
1/13 0 0 0 0 0 0 1
1/14 0 0 0 0 0 0 1
1/2 1 0 0 0 0 0 737
1/3 0 1 0 0 0 0 498
1/5 0 0 0 0 0 0 6
1/6 0 0 0 0 0 0 3
2/2 0 0 1 0 0 0 56
2/3 0 0 0 1 0 0 69
3/3 0 0 0 0 1 0 22
NA 0 0 0 0 0 1 144

Data Manipulation: Medications

iwpc_data %<>% mutate(medications = tolower(medications))
iwpc_data %>% filter(str_detect(medications, "amiodarone")) %>% count(medications) %>% datatable()

Data Manipulation: Amiodarone

iwpc_data %>% filter(str_detect(medications, "amiodarone")) %>% count(medications)
## Source: local data frame [178 x 2]
## 
##                                                                    medications
##                                                                          (chr)
## 1  acetaminophen; albuterol; amiodarone; aspirin; bisacodyl; glyburide; ipratr
## 2  acetaminophen; amiodarone;  hcl; digoxin; diltiazem hcl; lansoprazole; levo
## 3  albuterol; atrovent; combivent; amitriptyline; benzapril; advir; docusate; 
## 4  altace; ambien; amiodarone; celebrex; clonazepam; coreg; coumadin; flomax; 
## 5                                                                   amiodarone
## 6                    amiodarone hcl; digoxin; furosemide; lisinopril; tramadol
## 7  amiodarone; ;aspirin; atenolol;  calcium; digoxin; fosinopril sodium; furos
## 8  amiodarone; ;aspirin; digoxin; docusate sodium; furosemide; glimepiride; ir
## 9                                                    amiodarone; acetaminophen
## 10 amiodarone; amoxicillin; fosamax; mvi; vitamin b; vitamin d; vitamin e; war
## ..                                                                         ...
## Variables not shown: n (int)
iwpc_data %>% filter(str_detect(medications, "amiodarone")) %>% count()
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1  1415
iwpc_data %>% filter(str_detect(medications, "(^|;)[a-z ]*amiodarone")) %>% count()
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1  1415
iwpc_data %>% mutate(medications = str_extract(medications, "(^|;)[a-z ]*amiodarone")) %>% count(medications)
## Source: local data frame [6 x 2]
## 
##                           medications     n
##                                 (chr) (int)
## 1                        ; amiodarone   168
## 2                     ; no amiodarone   190
## 3                    ; not amiodarone   967
## 4                          amiodarone    86
## 5 cordarone or pacerone or amiodarone     4
## 6                                  NA  4285
iwpc_data %>% mutate(medications = str_extract(medications, "(^|;)[a-z ]*amiodarone"),amiodarone_test = str_detect(medications, "(; amiodarone)|(^[a-z ]*amiodarone)")) %>% count(medications,amiodarone_test)
## Source: local data frame [6 x 3]
## Groups: medications [?]
## 
##                           medications amiodarone_test     n
##                                 (chr)           (lgl) (int)
## 1                        ; amiodarone            TRUE   168
## 2                     ; no amiodarone           FALSE   190
## 3                    ; not amiodarone           FALSE   967
## 4                          amiodarone            TRUE    86
## 5 cordarone or pacerone or amiodarone            TRUE     4
## 6                                  NA              NA  4285
iwpc_data %>% mutate(medications_temp = str_extract(medications, "(^|;)[a-z ]*amiodarone"), amiodarone_test = str_detect(medications, "(; amiodarone)|(^[a-z ]*amiodarone)"), test = ifelse(is.na(medications)|!str_detect(medications, "(; amiodarone)|(^[a-z ]*amiodarone)"),yes = 0,no = 1)) %>% count(medications_temp,amiodarone_test,test)
## Source: local data frame [7 x 4]
## Groups: medications_temp, amiodarone_test [?]
## 
##                      medications_temp amiodarone_test  test     n
##                                 (chr)           (lgl) (dbl) (int)
## 1                        ; amiodarone            TRUE     1   168
## 2                     ; no amiodarone           FALSE     0   190
## 3                    ; not amiodarone           FALSE     0   967
## 4                          amiodarone            TRUE     1    86
## 5 cordarone or pacerone or amiodarone            TRUE     1     4
## 6                                  NA           FALSE     0  2892
## 7                                  NA              NA     0  1393
iwpc_data %<>% mutate(amiodarone = ifelse(is.na(medications)|!str_detect(medications, "(; amiodarone)|(^[a-z ]*amiodarone)"),yes = 0,no = 1))

Data Manipulation: Enzyme Enducers

#### Carbamazepine
iwpc_data %>% filter(str_detect(medications,"carbamazepine")) %>% count()
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1   176
iwpc_data %>% filter(str_detect(medications,"(^|;)[a-z ]*carbamazepine")) %>% count()
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1   176
iwpc_data %>% mutate(medications_temp = str_extract(medications,"(^|;)[a-z ]*carbamazepine")) %>% count(medications_temp)
## Source: local data frame [5 x 2]
## 
##      medications_temp     n
##                 (chr) (int)
## 1    ;  carbamazepine     1
## 2     ; carbamazepine    12
## 3 ; not carbamazepine   160
## 4       carbamazepine     3
## 5                  NA  5524
iwpc_data %<>% mutate(carbamazepine = ifelse(is.na(medications)|!str_detect(medications,"(^|;)[a-z ]*carbamazepine"),yes = 0,no = 1))

#### Phenytoin
iwpc_data %>% filter(str_detect(medications,"phenytoin")) %>% count()
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1   179
iwpc_data %>% filter(str_detect(medications,"(^|;)[a-z ]*phenytoin")) %>% count()
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1   179
iwpc_data %>% mutate(medications_temp = str_extract(medications,"(^|;)[a-z ]*phenytoin")) %>% count(medications_temp)
## Source: local data frame [4 x 2]
## 
##   medications_temp     n
##              (chr) (int)
## 1  ; not phenytoin   160
## 2      ; phenytoin    17
## 3        phenytoin     2
## 4               NA  5521
iwpc_data %<>% mutate(phenytoin = ifelse(is.na(medications)|!str_detect(medications,"(^|;)[a-z ]*phenytoin"),yes = 0,no = 1))

#### rifampin
iwpc_data %>% filter(str_detect(medications,"rifampin")) %>% count()
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1     3
iwpc_data %>% filter(str_detect(medications,"(^|;)[a-z ]*rifampin")) %>% count()
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1     3
iwpc_data %>% mutate(medications_temp = str_extract(medications,"(^|;)[a-z ]*rifampin")) %>% count(medications_temp)
## Source: local data frame [3 x 2]
## 
##   medications_temp     n
##              (chr) (int)
## 1       ; rifampin     2
## 2         rifampin     1
## 3               NA  5697
iwpc_data %<>% mutate(rifampin = ifelse(is.na(medications)|!str_detect(medications,"(^|;)[a-z ]*rifampin"),yes = 0,no = 1))

#### rifampicin
iwpc_data %>% filter(str_detect(medications,"rifampicin")) %>% count()
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1     0
iwpc_data %>% filter(str_detect(medications,"(^|;)[a-z ]*rifampicin")) %>% count()
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1     0
iwpc_data %>% mutate(medications_temp = str_extract(medications,"(^|;)[a-z ]*rifampicin")) %>% count(medications_temp)
## Source: local data frame [1 x 2]
## 
##   medications_temp     n
##              (chr) (int)
## 1               NA  5700
iwpc_data %<>% mutate(rifampicin = ifelse(is.na(medications)|!str_detect(medications,"(^|;)[a-z ]*rifampicin"),yes = 0,no = 1))

#### enzyme
iwpc_data %>% mutate(enzyme_enducer = ifelse((carbamazepine+phenytoin+rifampin+rifampicin)>0,yes = 1,no = 0)) %>% count(carbamazepine,phenytoin,rifampin,rifampicin,enzyme_enducer)
## Source: local data frame [5 x 6]
## Groups: carbamazepine, phenytoin, rifampin, rifampicin [?]
## 
##   carbamazepine phenytoin rifampin rifampicin enzyme_enducer     n
##           (dbl)     (dbl)    (dbl)      (dbl)          (dbl) (int)
## 1             0         0        0          0              0  5503
## 2             0         0        1          0              1     3
## 3             0         1        0          0              1    18
## 4             1         0        0          0              1    15
## 5             1         1        0          0              1   161
iwpc_data %<>% mutate(enzyme_enducer = ifelse((carbamazepine+phenytoin+rifampin+rifampicin)>0,yes = 1,no = 0))

Data Manipulation: Age

iwpc_data %>% count(age)
## Source: local data frame [10 x 2]
## 
##        age     n
##     (fctr) (int)
## 1   19-Oct    14
## 2  20 - 29   130
## 3  30 - 39   230
## 4  40 - 49   540
## 5  50 - 59  1085
## 6  60 - 69  1384
## 7  70 - 79  1570
## 8  80 - 89   670
## 9      90+    35
## 10      NA    42
iwpc_data %>% count(age, substr(age,1,1), as.numeric(substr(age,1,1)))
## Source: local data frame [10 x 4]
## Groups: age, substr(age, 1, 1) [?]
## 
##        age substr(age, 1, 1) as.numeric(substr(age, 1, 1))     n
##     (fctr)             (chr)                         (dbl) (int)
## 1   19-Oct                 1                             1    14
## 2  20 - 29                 2                             2   130
## 3  30 - 39                 3                             3   230
## 4  40 - 49                 4                             4   540
## 5  50 - 59                 5                             5  1085
## 6  60 - 69                 6                             6  1384
## 7  70 - 79                 7                             7  1570
## 8  80 - 89                 8                             8   670
## 9      90+                 9                             9    35
## 10      NA                NA                            NA    42
iwpc_data %>% mutate(age_decades = as.numeric(substr(age,1,1)))
## Source: local data frame [5,700 x 41]
## 
##     subject_id   sample_id project_site gender race_reported race_omb
##         (fctr)      (fctr)        (int) (fctr)        (fctr)   (fctr)
## 1  PA135312261 PA135312629            1   male         White    White
## 2  PA135312262 PA135312630            1 female         White    White
## 3  PA135312263 PA135312631            1 female         White    White
## 4  PA135312264 PA135312632            1   male         White    White
## 5  PA135312265 PA135312633            1   male         White    White
## 6  PA135312266 PA135312634            1   male         White    White
## 7  PA135312267 PA135312635            1   male         White    White
## 8  PA135312268 PA135312636            1   male         White    White
## 9  PA135312269 PA135312637            1 female         White    White
## 10 PA135312270 PA135312638            1   male         White    White
## ..         ...         ...          ...    ...           ...      ...
## Variables not shown: ethnicity_reported (fctr), ethnicitiy_omb (fctr), age
##   (fctr), height (dbl), weight (dbl), indication (fctr), comorbidities
##   (fctr), medications (chr), target_inr (dbl), target_inr_estimated
##   (fctr), reached_stable_dose (int), therapeutic_warfarin_dose (dbl),
##   inr_on_warfarin (dbl), smoker (int), cyp2c9_consensus (fctr),
##   vkorc1_1639_consensus (fctr), asian (dbl), african_american (dbl),
##   missing_or_mixed_race (dbl), vkorc1_1639_ag (dbl), vkorc1_1639_aa (dbl),
##   vkorc1_1639_unknown (dbl), cyp2c9_1_2 (dbl), cyp2c9_1_3 (dbl),
##   cyp2c9_2_2 (dbl), cyp2c9_2_3 (dbl), cyp2c9_3_3 (dbl), cyp2c9_unknown
##   (dbl), amiodarone (dbl), carbamazepine (dbl), phenytoin (dbl), rifampin
##   (dbl), rifampicin (dbl), enzyme_enducer (dbl), age_decades (dbl)

Data Visualization: Warfarin Dose

iwpc_data %>% count(therapeutic_warfarin_dose)
## Source: local data frame [565 x 2]
## 
##    therapeutic_warfarin_dose     n
##                        (dbl) (int)
## 1                       2.10     1
## 2                       2.50     1
## 3                       3.50     8
## 4                       4.50     2
## 5                       4.90     1
## 6                       5.00     1
## 7                       5.25     9
## 8                       5.53     1
## 9                       5.60     1
## 10                      5.81     1
## ..                       ...   ...
iwpc_data %>% ggplot(aes(x=1,y = therapeutic_warfarin_dose)) + geom_boxplot()
## Warning: Removed 172 rows containing non-finite values (stat_boxplot).

iwpc_data %>% ggplot(aes(x=1,y = sqrt(therapeutic_warfarin_dose))) + geom_boxplot()
## Warning: Removed 172 rows containing non-finite values (stat_boxplot).

Data Manipulation: Warfarin Dose

iwpc_data %>% count(reached_stable_dose)
## Source: local data frame [3 x 2]
## 
##   reached_stable_dose     n
##                 (int) (int)
## 1                   0   226
## 2                   1  5425
## 3                  NA    49
iwpc_data %<>% filter(reached_stable_dose!=0)

iwpc_data %<>% mutate(sqrt_warfarin_dose = sqrt(therapeutic_warfarin_dose))

Modeling

Model Tidying

Model Visualization

Web App for Warfarin Dose